Levy Equilibrium Optimizer algorithm for the DNA storage code set

The generation of massive data puts forward higher requirements for storage technology. DNA storage is a new storage technology which uses biological macromolecule DNA as information carrier. Compared with traditional silicon-based storage, DNA storage has the advantages of large capacity, high density, low energy consumption and high durability. DNA coding is to store data information with as few base sequences as possible without errors. Coding is a key technology in DNA storage, and its results directly affect the performance of storage and the integrity of data reading and writing. In this paper, a Levy Equilibrium Optimizer (LEO) algorithm is proposed to construct a DNA storage code set that satisfies combinatorial constraints. The performance of the proposed algorithm is tested on 13 benchmark functions, and 4 new global optima are obtained. Under the same constraints, the DNA storage code set is constructed. Compared with previous work, the lower bound of DNA storage code set is improved by 4–13%.


Introduction
With the rapid progress of science and technology and the increasing popularity of high-speed network, network data, mobile data, social data and other digital information data are increasing exponentially. According to IDC, the total amount of global data will reach 175ZB by 2025. Storage devices based on physical media can not cope with the explosive growth of data. Therefore, how to store these massive data has become a key problem for the long-term sustainable development of information technology. As a new storage method, DNA data storage technology plays an important role in saving storage energy and promoting the development of big data storage. The idea of using DNA molecules to store and store information appeared as early as the 1960s. Since it was difficult to read and write DNA information, it was not until 1988 that Davis [1] began to use DNA to store a small amount of information, but the storage information was very small. In recent years, with the rapid reduction of the cost of base synthesis and the development of DNA sequencing technology, some more practical work has appeared. In 2012, Church et al. [2] adopted A binary model to encode and store digital information by using bases A and C to represent 0 in binary and bases G and T to represent 1 in binary. In order to reduce the error rate, it is required to avoid 4 or more consecutive same bases in the coding information, and ensure the stability of GC content. The compiled DNA a1111111111 a1111111111 a1111111111 a1111111111 a1111111111 sequence is synthesized into several short DNA fragments. Then, through second-generation sequencing, the content of the synthesized DNA fragment is read out, and finally converted into a short fragment, and the position of the fragment in the whole file is found according to the bar code to obtain the original file. Goldman et al. [3] adopted ternary coding model based on Church's work, that is, each bit of information is represented by 0, 1 and 2 states. They used homopolymer-free DNA sequences to encode ternary digital information. The scheme proposed by Goldman has certain error correction ability, so it can effectively reduce the error rate compared with Church's work when reading the information stored in DNA. Yazdi et al. [4] described the first DNA-based storage architecture that allowed random access to blocks of data and the rewriting of information stored anywhere within the blocks. The system is based on new constraint coding techniques and corresponding DNA editing methods to ensure data reliability, specificity and access sensitivity. Ceze et al. [5] proposed an architecture for a DNAbased archival storage system. The team managed to encode data from four image files into the nucleotide sequences of synthetic DNA fragments, and they were also able to retrieve the correct nucleotide sequences from a larger pool of DNA and reconstruct the image without losing a single byte of information. Shortly afterward, Microsoft announced that it had saved about 200 megabytes of data using DNA storage technology, including "War and Peace" and 99 classic literary works. In 2017, Erlich and Zielinski [6] proposed the Coding method of DNA fountain algorithm to achieve error-free storage and achieve the coding rate of 1.6bit/nt. In the same year, Shipman et al. [7] introduced images and videos encoded as DNA sequences into the genome of EScherichia coli and read corresponding images and videos from the genome of living bacterial cells. In 2018, the Grass team [8] encoded and stored 35 different files in over 13 million DNA oligonucleotides and could recover each file lossless using random methods. A large primer library has been designed and validated, which can independently recover all files stored in DNA. DNA as a long-term storage medium has been preliminarily demonstrated for storage potential [9][10][11][12].
DNA coding is a key technology in DNA storage, which aims to store data information with as few base sequences as possible without error. The results of DNA coding directly affect the performance of storage and the integrity of data read and write. Reasonable and efficient coding is very important for the whole DNA storage system. In 2012, Church et al. [2] used DNA synthesis technology and second-generation sequencing technology to encode 0.65MB of abiotic information into DNA sequence, which was the first application of binary model and achieved an information storage density of 0.83 bit/nt. With the in-depth research on DNA coding, Ross et al. [13] reported that replacement and deletion errors would increase significantly when the running length of homomer exceeded 6. On the other hand, DNA chains with too high or too low GC content were more prone to synthesis and sequencing errors. Due to the above reasons, Bornholt et al. [14] adopted XOR coding principle to improve Church's coding scheme, which not only realized random access but also achieved 0.85 bit/nt storage density. In terms of error correction coding, Blawat et al. [15] introduced forward error correction to achieve a storage density of 1.08bit/nt. Not long after, Yazdi et al. [4] overcame the need for full sequencing when reading data, designed a coding method to achieve random access through address bit addressing, and a platform for efficient sequencing through iterative alignment and deletion of error check codes, thus achieving high storage density. In 2017, Erlich et al. [6] creatively designed a "fountain code" for information storage, which is highly robust and efficient because it can avoid GC content with high deviation and the generation of homomeric. This is the first time that fountain code introduced in communication coding makes net information density as close to Shannon limit as possible, and shows that error detection/correction algorithm is not necessary for error correction, and the same effect can be achieved by screening sequences. Jeong et al. [16] further improved the fountain code and obtained better decoding results by clustering. Wang et al. [17] designed an encoding method consisting of repeated additive codes (RA) and an efficient hybrid mapping scheme to achieve a storage density of 1.67bit/nt. Zhang et al. used combinatorial constraints to screen DNA storage codes [18][19][20], and used heuristic algorithms such as CLGBO [21] and NOL-HHO [20] to construct DNA storage code sets, constructing DNA storage code sets of higher quality. Yehezkeli et al. [22] consider noise introduced entirely by uniformly repeated sequences and exploit the relation with equal weight integers in the Manhattan metric. The existence of full-rate reconstruction codes is proved using hyperplane restricted multifaceted wall intersections [23], and a method for the construction of a class of reconstruction codes is given. Lenz et al. present a storage model for disordered sequence representations, deriving a Gilbert-Varshamov lower bound on the reachable bases of error-correcting codes and an upper bound on spherical wrappers [24]. In 2021, Zan et al. [25] proposed a hierarchical error-correction strategy for text DNA storage based on divide-and-conquer algorithm to complete lossless storage of text.
Heuristic algorithms can provide a feasible solution for each instance of the combinatorial optimization problem to be solved at an acceptable cost. The deviation degree of the feasible solution from the optimal solution cannot be predicted in general. Classical heuristic algorithms include genetic algorithm [26], particle swarm optimization [27] algorithm, etc. New ones include the Monarch Butterfly Optimization (MBO) [28], Slime Mould Algorithm (SMA) [29], Moth Search Algorithm (MSA) [30], hunger games search (HGS) [31], RUNge Kutta method (RUN) [32], colony predation algorithm (CPA) [33], Weighted mean of vectors (INFO) [34] and Harris Hawks Optimization (HHO) [20]. They are widely used in engineering to optimize traditional complex engineering problems such as distribution system [35], power flow [36], and power grid [37]. It is also often applied to solve optimization problems in the biological field [38,39].
The DNA storage coding problem can be equivalent to the DNA coding screening problem satisfying the combinatorial constraints. However, because of the high complexity of the computational process of constraints, the efficiency of using traditional algorithms is too low. Due to the problems of low base utilization and low coding quality in existing DNA storage coding methods, this work constructed a DNA storage coding set that met the combination constraints through the improved LEO (Levy Equilibrium Optimization) to ensure both coding efficiency and coding quality. The LEO algorithm improved by Levy optimizer reduces the possibility of the original algorithm falling into local optimum, and improves the convergence speed of the algorithm. It is possible to construct more sets of DNA storage codes that satisfy the constraints. The constructed encoding set satisfies the Hamming distance constraint, the GC content constraint and the No-runlength constraint, and has some error correction capability. It also offers many coding advantages such as high robustness, low coding complexity and shorter coding time.

Hamming distance constraint
The Hamming distance can be used in other research areas, such as in coding theory, to measure the similarity of two codewords. In DNA storage, a smaller Hamming distance in coding [40] can indicate that there are many identical bases between two different DNA codewords, i.e., an increased possibility of non-specific hybridization. For two different DNA codes j, k, HD(j, k) denotes the number of different bases at position i of sequence j, k. The Hamming distance constraint expression can usually be expressed by the following mathematical formula with HD(j, k) � d. The Hamming distance is calculated as follows:

GC content constraint
A, T, C and G are the four bases that constitute the structure of DNA, among which A and T can form A double-stranded structure when they are complementary, as can G and C. In actual biological operations, sequences with extreme GC content are unstable, so sequences are generally designed according to 40%-60% GC content, which is the GC content constraint condition [41].

No-runlength constraint
Continuous bases lead to the instability of the molecular structure of the whole sequence, and the hybridization reaction is difficult to control. Errors are especially prone when reading long homopolymers. Therefore, in the coding process, we use No-runlength constraints [42] to try to avoid similar errors. Running the same nucleotides over long periods of time can cause errors in the DNA code. For example, TCCCCAC, C is repetitive, so it is easy to read long C into short C in synthesis and sequencing, resulting in an increase in the error rate of DNA storage information and a decrease in read and write coverage. For code words L (l 1 , l 2 , l 3 . . . l n ) is the length of n, and for any I:

Equilibrium optimizer
Equilibrium optimization algorithms are inspired by a variety of phenomena in physics, such as mixed dynamic mass balances. The mass balance equation in the mixed dynamic mass balance weight is used to describe the dynamic equilibrium process that limits the concentration of non-reactive substances in the volume. The mass balance equation has the role of providing a fundamental physical explanation for the conservation of mass entering, leaving and arising in the control volume. More detailed information related to the mixed dynamic mass balance process can be found further in the original paper [43]. The steps of EO algorithm are as follows: Step 1: Initialization Initialization is performed according to the multiple parameters in the search space, and the initial concentration is constructed using the number and dimension of uniformly random initialized particles with the following mathematical equation: Hereṽ i represents the concentration vector of particle i, c max , c min representing the upper and lower bounds of the dimension respectively. r 1 represents a random vector between [0,1] and contains n groups of particles.
Step 2: Balance pool and candidate pool Population intelligence algorithms such as the EO algorithm and the particle swarm ant colony algorithm are population-based algorithms. These algorithms divide the search process into two phases: exploration and exploitation. Each algorithm has a different approach to exploration and exploitation. For all heuristic algorithms, there is an optimization objective based on their properties. For example, the optimization search process of the ant colony algorithm is carried out by searching for food for ants, in contrast to the EO algorithm, which searches for equilibrium states of the search food. However, in the optimization process of the EO algorithm, there is no specific level of concentration to reach the equilibrium state, so the equilibrium state is artificially defined by the four best particles found and the average particle. These five particles help the EO algorithm to perform better in exploration and exploitation, and they all exist in an equilibrium pool, mathematically formulated as follows:p eq;pool ¼p eqð1Þ ;p eqð2Þ ;p eqð3Þ ;p eqð4Þ ;p eqðavgÞ h i ð5Þ Step 3: Update method of concentration EO algorithms need to find a reasonable balance between development ability and exploration ability, and this process is achieved by balancing turnoverF. In some control volume, the rate of turnover varies with time, assumingl is a random vector between 0 and 1.
Where t is with the increment in iteration, the formula is as follows Where iter and t max represent the current iteration number and the maximum iteration number respectively, a 2 represents the constant value of the control development capability. In addition, parameter a 1 is designed to enhance the diversity and exploration capacity of the population, as follows: Generation rate R is another parameter used to improve the development operator, and its formula is as follows:R Wherel is a random vector between [0,1], r 1 and r 2 are random numbers between 0,1, and RCP ��! is the control parameter for the generation rate and also has the update process to determine whether the generation rate will be applied to the EO algorithm. Finally, the update equation of EO is as follows: Here V is assigned 1. For more detailed introduction of EO algorithm, please refer to Faramarzi [43].

Levy Equilibrium Optimizer
Although the EO algorithm uses parameters such as a 1 to enhance the exploration ability of the population, the population richness of the EO algorithm still decreases in later iterations, a situation that is likely to increase the probability of falling into a local optimum, which may be exacerbated in the actual solution process due to more complex conditions. And the individual update mainly depends on the size of the turnover, and then update randomly according to the current optimal global and equilibrium pool. Since the early optimal global value of the algorithm is often too far from the true value, this strategy will increase the probability of the algorithm falling into local optimal, and may lead to a decrease in the convergence speed of the algorithm. A study by Reynolds et al. [44] showed that Drosophila flies explore their environment and search for food during foraging through a series of straight-line flight paths that are often interspersed with abrupt right-angle turns. An intermittent scale-free search model, called Levy, was proposed based on the scale-free flight of Drosophila. And the model was applied to the optimization process and optimal search by the researchers, and it was shown to have good search performance by preliminary results [45]. In LEO algorithm, Levy Flights update strategy is used to replace random update based on current global optimization, which reduces the influence of minimax pool individuals on update mechanism. Therefore, levy flight algorithm was added in the later iteration of the algorithm in this paper to accelerate the convergence of EO algorithm and jump out of local optimum through Levy flight operation. In this paper, levy flight algorithm is used to carry out Levy flight operation on the pool in the late iteration of EO algorithm and process the output of EO algorithm, which can expand the search scope of EO algorithm and obtain a larger code set. By initializing set S, determine whether all codes in set S and S EO meet the combination constraint one by one. The flow chart of LEO algorithm is shown in Fig 1.

Benchmark function
In order to verify the performance of the LEO algorithm more clearly, the test function approach is used in this paper. Benchmarking was carried out by using the 13 dominant benchmark functions [46] in Tables 1 and 2. On the one hand, different algorithms target different types of real-world problems, but on the other hand, it is uncertain whether each algorithm achieves the best results for each problem. Since the test functions are simulations of real problems, different algorithms may be suitable for different test functions. Thirteen benchmark functions were chosen, including seven high-dimensional single-peaked functions and six high-dimensional multi-peaked functions. These 13 functions have the ability to reflect most real-world problems, and testing them provides a useful indication of the performance of the algorithm. For the sake of fairness and to improve the reliability of the results and the rigour of the experiments, it is necessary to limit the domain of definition and the number of iterations of the test functions. In order to better illustrate the convergence process of LEO, it can [-100,100] 0 be clearly seen in Fig 2 that in the initial stage, LEO and EO maintain the same iteration efficiency, but in the later stage, LEO converges faster and is closer to the global optimum. This is because Levy flight is LEO jumping out of the local optimum and improving the iteration speed.
After running the 13 test functions for 30 times, the mean and variance of the results were compared with the original algorithm and other representative algorithms. We selected EO,  PSO, GWO, GA, GSA and SSA algorithms for comparison, among which EO is the latest work from Mirjalili et al. [29], GA is the earliest and well-performing evolutionary algorithm, PSO is a heuristic algorithm that mimics group behaviors and has group validity, and GSA is a generalization based on physical significance. The maximum number of iterations for these algorithms is set at 500. EO, PSO, GWO, GA, GSA and SSA results are derived from Faramarzi's work [43]. Tables 3 and 4 list the test functions used. F1-F7 is a high-dimensional single-peak function with global optimality, so it is usually used for general testing of algorithms. F8-F13 has a global optimal and several local optimal, and the number of local optimal solutions increases with the increase of dimension. This increases the difficulty of heuristic algorithm, and can better reflect the optimization speed and jump out of local optimal performance of an algorithm. Tables 3 and 4 show LEO's performance on the 13 test functions, and for the most part, LEO achieved the best results in the table. However, in the face of complex functions such as F12 and F13, LEO performance is unsatisfactory, which may be that in the face of multi-peak functions, the performance of Levy algorithm is limited, so the optimal solution is not obtained. However, on multi-dimensional unimodal functions, such as F1-5, LEO algorithms find the global optimal solution 0. In order to further illustrate the statistical significance of LEO algorithm, we conducted Wilcoxon test on LEO algorithm, and in most cases, LEO algorithm passed statistical verification. The results are shown in Table 5.

lower bound of the DNA storage code set
The DNA coding set with length n, hamming distance d and meeting hamming distance constraint, GC content constraint and no repeated base constraint is defined as A GC,NL (n, d, w). In Table 5, the results in the table are 4�n� 10, 3 �d�n satisfy the lower bound of the constraint. Any algorithm seeking optimization requires a fitness function, so the LEO algorithm uses the sum of the Hamming distances of one of the constraints as a fitness function for the DNA constraint encoding process.
fitness ¼ Hðs; S i Þ ð13Þ As shown in Table 6, we list the results based on the LEO algorithm and compare them with the best results from Li and Limbachiya [47]. The part in bold represents the optimal solution under the same constraints, A represents the best result in Limbachiya and Li, and LEO represents the result in this paper. When n = 9 and d = 4, the size of the DNA storage coding set constructed by the LEO algorithm was 13.2% higher than the results in previous representative work. This is because LEO algorithm uses Generation probability and Equilibrium pool mechanism to balance the process of exploration and development well, and levy flight strategy is used in the late iteration to jump out of local optimum and approach the optimal solution more closely. The results of LEO algorithm provide good initialization, and the balanced pool strategy further extends the results of EO algorithm. More DNA storage codes can reduce the cost consumption of DNA storage system and can perform the same function with the same length. Better quality DNA storage coding can reduce the error rate in the reading and writing process, ensure the overall operation of the DNA storage system, and DNA as a storage medium is also a low-carbon storage method. By comparing the results with those of Limbachiya and Li [47], it is clear that the LEO algorithm yields a significant advantage over the best of them in terms of coding. The LEO algorithm is an intelligent algorithm based on a greedy algorithm that removes the "worst" candidates in each iteration and iteratively removes potential code words to obtain a set of codes that satisfy the requirements. As the algorithm repeats, the altruistic algorithm greedily removes the maximum number of coding words d-1 in the radial range until the distance d of the coding set is minimal. However, altruistic algorithms based on greedy algorithms do not consider the global optimality, but only construct a local optimal solution in a specific sense. Similarly, EORS algorithm also has a random search phase, which is expected to search more valid DNA codes through greedy search strategy, but the time complexity is increased. Therefore, in this work, we use the heuristic algorithm LEO. LEO algorithm is an improvement of EO algorithm based on Levy algorithm and has the advantages of fast convergence speed and high population richness, which can help EO algorithm to converge faster and find the approximate optimal solution.

Conclusion
This paper proposes a LEO algorithm for DNA storage coding through combinatorial constraints. By approximating the DNA storage coding problem satisfying the constraints to a multi-objective optimization problem, the heuristic algorithm LEO is used to solve the approximate optimal solution of DNA storage coding. Not only can the native advantages of heuristic algorithms for non-linear multi-objective optimizations problems be fully exploited, but the low complexity of constraint encoding is also applied to the field of DNA storage encoding. Encoding that satisfies the constraints reduces the error rate in DNA synthesis and sequencing, as well as the probability of specific hybridization of DNA sequences during PCR. In order to illustrate the superiority of the LEO algorithm proposed in this paper, compared with many convincing algorithms under the benchmark function, the results show that the LEO algorithm has significant advantages in AVE and SD, indicating the effectiveness of the improved algorithm. A larger DNA coding set was constructed under the same combinatorial constraints, and the coding results achieved satisfactory results compared to previous work. The experiments show that in the majority of cases, the coding scheme proposed in this paper achieves satisfactory results compared to the optimal results of Li and Limbachiya, and the lower bound of the coding set is significantly improved, which also illustrates the excellent performance of the LEO algorithm proposed in this paper from the perspective of practical applications. Under the same constraints, the size of the LEO algorithm constructed DNA storage code set is increased by 4-13%. A larger set of stored codes can store more valid information in the same DNA length, reducing costs and improving read and write efficiency. This means that the same performance can be achieved in smaller code lengths, allowing for more efficient and competitive storage of DNA storage systems at a lower cost.
In future work, we will continue to focus on DNA storage coding and continue to study the existing problems of low coding efficiency, low coding quality and insufficient coding set. The intention is to achieve truly fully automated DNA storage as a powerful alternative to traditional silicon-based storage. In addition, the encryption and decryption of image information and text information can be considered for the security of DNA storage, and finally realize the encryption of carbon-based storage and computing integrated equipment similar to siliconbased computer.